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Voxels and Dual Contouring 


Posted on May 16, 2016 by dexyfex 
After spending some time researching methods for rendering voxel isosurfaces, I decided to imple- 
ment a dual contouring algorithm — which is the “dual” case of the infamous marching cubes al- 
gorithm. It’s still far from finished but it’s produced some promising results so far... 








[Fuzzy blobs: spheres 
with added ridged noise. Header image: spheres with negative ridged noise. And a planet.] 


Comparing the filled and the wireframe views of the image below can help reveal how the dual con- 
touring is done. For each voxel “cell” (a cube with a voxel point at each corner) in the grid, between 0 
and 3 quads can be generated, depending on which of the corner points are above/below a threshold 
value. If all corners are on the same side of the threshold, no quads are output. 








The output vertex points also need to be moved to an estimate of where the surface would be, based 
on the voxel gradients along each cell edge. This results in the perfectly round spheres above. Prob- 
ably the easiest way to visualise what is happening is to think of the terrain in Minecraft. Basically we 
start out with all those orthogonal cubic faces and then “flatten” the vertices into the smooth surface. 
If this last step weren't done, the output would actually look just like Minecraft. 


Note that the colours used here are the vertex coordinates in the grid, so some “blockiness” is visible 
in the colours due to them not being offset in the same manner as the vertex positions. 


To get it running at an interactive rate, I’ve devised a method of generating up to 128 voxel “chunks” 
(each 32x32x32 voxels) at a time all on the GPU. The results of of the chunk generation are put into 
larger vertex and index buffer caches, holding 256+ chunks (depending on available graphics 
memory). The updating is currently done immediately before rendering. I’m hesitant to introduce a 
chunk streaming approach due to the potential few frames that there would be nothing to display — 
so I’ve sacrificed some “fluidity” here, meaning that as the camera moves/rotates and uncached 
chunks need to be generated, they are done in that frame, there and then, potentially causing the 


frame to take much longer to render. Currently my terrain system does the same thing, in order to be 
able to ensure bits of the terrain don't suddenly blink out of existence before a higher or lower LOD is 
generated. But it's not actually causing a noticeable frame drop yet. In particular if the camera’s velo- 
city is kept small, generally only a small number of chunks need to be updated each frame, which is 
quite fast, and the cache saves having to regenerate everything if the camera is just rotated in circles. 


An outline ot the chunk update process is as follows: (CS: compute shader; GS: geometry shader — 
both using Shader Model 4.0) 


1. 


00 


(CS) Fill a 35x35x35 uint grid of the data, with 16 bits each of isovalue and material data. Cur- 
rently I’m just using static functions but eventually I want to have it generate a surface based off 
my terrain's heightmaps. 


. (CS) “March” the voxel cells and determine the “case index”. I used a method based off this paper 


well bitsmashed, with 4 voxel cells in each uint value that the compute shader outputs. Also in 
there is packed a value for each cell for whether the vertex is required to generate the surface, and 
the quad count for the cell. 


. (CS) “Reduce” the vertex and quad counts along the Z axis, so for each chunk a 32x32 grid of val- 


ues are output, each “pixel” containing the sum of vertex and index counts along each row of 
voxels. 


. (CS) Perform the same again, except collapsing the 32x32 grid into a 1x32 row of sums. 


. (CS) Perform the same yet again, to finally end up with 1 value for each chunk containing the 


number of vertices and quads for each chunk. (16 bits each). 


. (CS) Perform the reverse procedure as the above, but outputting the vertex and quad offsets for 


each row into a 1x32 row of uints. 


. (CS) Continue the “re-expansion” step to get a 32x32 grid of offsets. 
. (CS) Final re-expansion step, resulting in a 32x32x32 grid of vertex indices and quad offsets. 
. (CS) Update the cached “counts” buffer, which contains how many vertices and quads are to be 


rendered for each cached chunk. 


10. 


11. 


12. 


(CS) Update the cached vertices buffer. The CS is run for each cell in each updated chunk. If an in- 
dex is available for output, the vertex position is generated using the “contouring” part of the dual 
contouring algorithm. The vertex must always remain within its cell, and is found simply by lin- 
early interpolating all the cell’s estimated edge intersection points. The index found in the 
reduction/re-expansion phase is used here to perform a scatter write of the vertex data. 

(CS) Update the cached indices buffer. A similar thing is done here as for the vertex buffer above, 
except for the quads in each cell. The only difference here is that each cell could output up to 3 
quads, as opposed to the single vertex above. But that's OK, because the “quad index” generated 
previously will have accounted for that. 

(GS) At this stage, all the data required for rendering has been prepared. To render the quads, a 
geometry shader is run with 8192 input primitives. The vertex ID is simply passed from the vertex 
shader to the GS, and the GS runs using points as input. If the vertex ID arriving in the GS is 
higher than the number of vertices in the chunk (read from the “cached counts” buffer), nothing is 
output. Otherwise, between 1 and 3 quads can be output here. The vertex and index data gener- 
ated in the previous steps are used here to determine the vertices and quads to be output. 


Unfortunately this method is slightly wasteful in that potentially many GS invocations are not out- 
putting anything, and the total number of quads for each chunk does have a limit (albeit an arbitrary 
one designed to reduce this GS wastefulness). But with the shader exiting without doing unnecessary 
calculations, this impact can be almost unnoticeable on a modern GPU, considering they are designed 
to run many millions of invocations of shader programs in each frame. Using this technique also al- 
lows for no communication back from the GPU to the CPU, namely the number of quads required to 
draw each cell. This is important because it avoids stalling the pipeline. If Shader Model 5.0 is used, 
this issue can be removed completely by use of the DrawInstancedIndirect() function, although even 


that still leaves a little to be desired (a draw call would be required for each chunk). With the current 
method, only a single draw call is needed to draw all the chunks. 


There is plenty more to say on this subject, but honestly this is too long already and probably no-one 
will read it anyway — but congrats if you got here! :D. Obviously still on the to-do list is to have differ- 
ent data in each chunk (terrain data, asteroid data, etc), and LODs, which will likely be quite fiddly, 
especially the LOD fading. 


But anyways, I'll finish off with pictures of some spectacular disasters that have occurred during this 
development process. It’s all art, right? © 














